Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ately no a priori knowledge to say which is the best model for this

e discussing how the uncertainty of a cluster model structure can

handled when using the K-means clustering algorithm, two

ments can be used for the discussion. Figure 2.28 shows one

where a data set is composed of two clusters. Two clusters have

centre-to-centre (or between cluster) distance in three scenarios

But, two clusters have different within-cluster variances in three

Here a term is defined as the sum of within-cluster variances,

called the within-cluster sum of squares. It was 24,132, 15,831

5 in three panels of Figure 2.28, i.e., panel (a), panel (b) and panel

ectively. The sum of within-cluster variances is also called the

hin-cluster variance. Based on the comparison of these three

can be seen that the discrimination power between two clusters

on the total within-cluster variance. A greater total within-cluster

may lead to a poorer discrimination power between clusters.

, a smaller total within-cluster variance may result in a better

ation power between clusters. Therefore, the data presented in

28(a) may have the poorest discrimination power or the worst

g performance while the data presented in Figure 2.28(b) may

best clustering performance.

(a) (b) (c)

Three scenarios to show the impact of the total within cluster squares on the

performance. The dots stand for the data points and the triangles stand for the

res. ‘Sw’ stands for the total within-cluster variance.

e 2.29 shows another measurement, where two clusters are

d in three panels. In all three panels, two clusters have the same

hin-cluster variance. But, two clusters have different between-